Associative Clustering by Maximizing a Bayes Factor

نویسندگان

  • Janne Sinkkonen
  • Janne Nikkilä
  • Leo Lahti
  • Samuel Kaski
چکیده

Clustering by maximizing the dependency between (margin) groupings or partitionings of co-occurring data pairs is studied. We suggest a probabilistic criterion that generalizes discriminative clustering (DC), an extension of the information bottleneck (IB) principle to labeled continuous data. The criterion is the Bayes factor between models assuming dependence and independence of the two cluster sets, and it can be used as a well-founded criterion for IB for small data sets. With suitable prior assumptions the Bayes factor is equivalent to the hypergeometric probability of a contingency table with the optimized clusters at the margins, and for large data it becomes the standard mutual information. An algorithm for two-margin clustering of paired continuous data, associative clustering (AC), is introduced. Genes are clustered to find dependencies between gene expression and transcription factor binding, and dependencies between expression in different organisms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Associative Clustering (AC): Technical Details

This report contains derivations which did not fit into the paper [3]. Associative clustering (AC) is a method for separately clustering two data sets when one-to-one associations between the sets, implying statistical dependency, are available. AC finds Voronoi partitionings that maximize the visibility of the dependency on the cluster level. The main content of this paper are technical result...

متن کامل

A Comparative Study of Issues in Big Data Clustering Algorithm with Constraint Based Genetic Algorithm for Associative Clustering

Clustering can be defined as the process of partitioning a set of patterns into disjoint and homogeneous meaningful groups, called clusters. The growing need for distributed clustering algorithms is attributed to the huge size of databases that is common nowadays. The task of extracting knowledge from large databases, in the form of clustering rules, has attracted considerable attention. Distri...

متن کامل

Integration of Transcription Factor Binding and Gene Expression by Associative Clustering

We integrate paired genomic data sets to reveal their dependencies. We suggest using a dependency-maximizing clustering method for the task. The recently introduced method associative clustering (AC) finds groupings of genes for which the two data sources are maximally dependent. The dependencies between data sources become represented as a contingency table, which is optimized to reveal the as...

متن کامل

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

Supervised clustering is a data mining technique that assigns a set of data to predefined classes by analyzing dataset attributes. It is considered as an important technique for information retrieval, management, and mining in information systems. Since customer satisfaction is the main goal of organizations in modern society, to meet the requirements, 137 call center of Tehran city council is ...

متن کامل

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

Supervised clustering is a data mining technique that assigns a set of data to predefined classes by analyzing dataset attributes. It is considered as an important technique for information retrieval, management, and mining in information systems. Since customer satisfaction is the main goal of organizations in modern society, to meet the requirements, 137 call center of Tehran city council is ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003